Chromatin Immunoprecipitation Sequencing ◾ 223
ENCFF000XJP_chp1_filt_so.bam \
-p chr17:56084561-56084661 \
../ref/hg19.fa
The “samtools tview” provides a quick way to visualize the alignments in a BAM file and
to inspect the binding site in a specific location. However, when you need to view a specific
gene, you may need to inspect the BAM file to see how the chromosomes are named. You
can also find gene coordinates in the reference genome from the NCBI Gene database.
Viewing a BAM file is not necessary unless you need to inspect the alignments.
The next step is to use one of the peak-calling programs or shortly a peak caller that uses
a ChIP-Seq BAM file and a control BAM file as inputs to perform the process of peak call-
ing as discussed above. There are several peak callers available as open source, but we will
use the oldest and the most commonly used one, MACS3.
6.3.4 ChIP-Seq Peak Calling with MACS3
MACS (Model-based Analysis of ChIP-Seq) [6, 7] is one of the popular peak callers,
suitable for identifying TF binding sites. MACS utilizes library complexity to evaluate
the significance of enriched ChIP peaks by modeling the distribution of reads along the
genome using Poisson distribution and then it estimates the FDR for each detected peak
empirically. MACS can be used for peak calling using only ChIP-Seq data without pro-
viding control or with a control sample (input reads). Sliding window of different sizes is
used for finding peaks by calculating the expected number of reads aligned to that region.
MACS can perform removal of redundant reads and read shifting. It generates a number
of statistics for peak calling.
Basically, MACS separates upstream and downstream aligned reads and aligns them
by the midpoint between their peaks. Assume that the distance between the two peaks is
d bp, the all reads will be shifted by d/2 bp toward 3′ ends to cover the most likely DNA
protein-binding sites. MACS models read distribution along the genome by Poisson distri-
bution to perform peak calling and to provide p-value. After read shifting, the algorithm
uses a sliding window of 2d bp to identify candidate peaks with a significant enrichment.
It also merges the overlapping peaks and extends reads d bases from their center. The posi-
tion with the highest fragment pileup is called the summit, which is identified as the exact
location of the protein-binding site in the genome. A Poisson distribution parameter λ is
estimated for each peak and peaks with p-values less than a threshold will be called. The
ratio between the ChIP-Seq read count and λ is estimated as the fold enrichment.
For installation instruction, visit “https://github.com/macs3-project/MACS”. On Linux,
you can install MACS3 using “pip install macs3” and then you can test it by running
“macs3 callpeak -h” to display the help. In the following, we will use MACS3 to analyze the
three ChIP-Seq BAM files while using the corresponding control BAM file. Each “macs3”
command requires a ChIP-Seq BAM file “-t” and a control BAM file “-c”. The “-f” option
specifies the file format of the input file, “-g” option specifies the effective genome size,
“-n” option specifies the string that is prefixed to the output files, “-q” option specifies the
FDR threshold, “--bdg” option is to generate a bedGraph file, which is a format that allows